Load the diamonds data set, as well as ggplot2
setwd("/Users/Nick/Documents/udacity/projects/data_analysis_with_r/")
library(ggplot2)
library(gridExtra)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
##
## The following object is masked from 'package:scales':
##
## percent
##
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
##
## The following objects are masked from 'package:base':
##
## as.array, trimws
library(RColorBrewer)
library(RCurl)
## Loading required package: bitops
library(bitops)
data(diamonds)
Create a scatterplot of price vs x for the diamonds data set
ggplot(data=diamonds, aes(x=x, y=price)) + geom_point()
Testing correlations between price and x/y/z
with(diamonds, cor.test(x, price))
##
## Pearson's product-moment correlation
##
## data: x and price
## t = 440.16, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8825835 0.8862594
## sample estimates:
## cor
## 0.8844352
with(diamonds, cor.test(y, price))
##
## Pearson's product-moment correlation
##
## data: y and price
## t = 401.14, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8632867 0.8675241
## sample estimates:
## cor
## 0.8654209
with(diamonds, cor.test(z, price))
##
## Pearson's product-moment correlation
##
## data: z and price
## t = 393.6, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8590541 0.8634131
## sample estimates:
## cor
## 0.8612494
Create a simple scatter plot of price vs depth. Add to the code some transparency in the dots (1/100) and mark the x-axis every 2 units.
ggplot(data=diamonds, aes(x=depth, y=price)) + geom_point(alpha=0.01) +
scale_x_continuous(breaks=seq(40, 80, 2))
The above plot places the vast majority of diamonds between depths of 60 and 64.
What’s the correlation of depth vs. price?
with(diamonds, cor.test(depth, price))
##
## Pearson's product-moment correlation
##
## data: depth and price
## t = -2.473, df = 53938, p-value = 0.0134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.019084756 -0.002208537
## sample estimates:
## cor
## -0.0106474
The correlation between the two is -0.01. This would be a horrible predictive characteristic of diamond price.
Create a scatterplot of price vs carat, and omit the top 1% of price and carat values.
ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point() +
xlim(0, quantile(diamonds$carat, 0.99)) +
ylim(0, quantile(diamonds$price, 0.99))
## Warning: Removed 926 rows containing missing values (geom_point).
Create a scatterplot of price vs volume (x * y * z). In the process, create a new variable for volume in the diamonds data frame.
diamonds$volume = diamonds$x * diamonds$y * diamonds$z
ggplot(data=diamonds, aes(x=volume, y=price)) + geom_point()
The x-axis needs to be rescaled to exclude the worst outliers (like 3?). Generally though, price trends upward with diamond volume.
What’s the correlation of price and volume? Exclude diamonds that have a volume of 0 or that are greater than or equal to 800.
with(subset(diamonds, (diamonds$volume > 0) & (diamonds$volume < 800)), cor.test(volume, price))
##
## Pearson's product-moment correlation
##
## data: volume and price
## t = 559.19, df = 53915, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9222944 0.9247772
## sample estimates:
## cor
## 0.9235455
Correlation: 0.9235455
Subset the data to exclude diamonds with a volume greater tha or equal to 800. Also, exclude diamonds with a volume of 0. Adjust the transparency of the points and add a linear model to the plot.
Do you think this would be a useful model to estimate the price of diamonds? Why or why not?
diam_sub = subset(diamonds, (diamonds$volume > 0) & (diamonds$volume < 800))
ggplot(data=diam_sub, aes(x=volume, y=price)) + geom_point(alpha=0.01) +
geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
This is definitely not a useful model beyond a volume of about 400. Perhaps even before then.
Use the function dplyr package to create a new data frame containing info on diamonds by clarity. Name the data frame “diamondsByClarity”. The data should contain the following variables in this order: - mean_price - median_price - min_price - max_price - n where “n” is the number of diamonds in each level of clarity
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:nlme':
##
## collapse
##
## The following objects are masked from 'package:memisc':
##
## collect, query, rename
##
## The following object is masked from 'package:MASS':
##
## select
##
## The following object is masked from 'package:GGally':
##
## nasa
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
diamondsByClarity <- diamonds %>%
group_by(clarity) %>%
summarise(mean_price = mean(price),
median_price = median(price),
min_price = min(price),
max_price = max(price),
n = n())
diamondsByClarity
## Source: local data frame [8 x 6]
##
## clarity mean_price median_price min_price max_price n
## (fctr) (dbl) (dbl) (int) (int) (int)
## 1 I1 3924.169 3344 345 18531 741
## 2 SI2 5063.029 4072 326 18804 9194
## 3 SI1 3996.001 2822 326 18818 13065
## 4 VS2 3924.989 2054 334 18823 12258
## 5 VS1 3839.455 2005 327 18795 8171
## 6 VVS2 3283.737 1311 336 18768 5066
## 7 VVS1 2523.115 1093 336 18777 3655
## 8 IF 2864.839 1080 369 18806 1790
diamonds_mp_by_clarity and diamonds_mp_by_color are summary data frames with the mean price by clarity and color. Create two bar plots on one output image using the grid.arrange() function from the package gridExtra
library(gridExtra)
diamonds_by_clarity <- group_by(diamonds, clarity)
diamonds_mp_by_clarity <- summarise(diamonds_by_clarity, mean_price = mean(price))
diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))
plot1 = ggplot(data=diamonds_mp_by_clarity, aes(factor(clarity), y=mean_price)) + geom_bar(stat="identity")
plot2 = ggplot(data=diamonds_mp_by_color, aes(x=factor(color), y=mean_price)) + geom_bar(stat="identity")
grid.arrange(plot1, plot2, ncol=2)
What do you notice in each of the bar charts for mean price by clarity and mean price by color? clarity: I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best) color: J (worst) to D (best)
Generally the better the color, the lower the mean price, which is quite surprising. Similarly, aside from a couple spikes, higher clarity diamonds have lower mean prices.
This is strange. Let’s consider the mean price w.r.t. cut
diamonds_by_cut <- group_by(diamonds, cut)
diamonds_mp_by_cut <- summarise(diamonds_by_cut, mean_price = mean(price))
ggplot(data=diamonds_mp_by_cut, aes(factor(cut), y=mean_price)) + geom_bar(stat="identity")
Create a histogram of diamond prices. Facet the histogram by diamond color and use cut to color the histogram bars
ggplot(data=diamonds, aes(x=price, fill=cut)) + geom_histogram() +
facet_wrap(~color)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Create a scatterplot of diamond price vs. table and color the points by the cut of the diamond.
ggplot(data=diamonds, aes(x=table, y=price, color=cut)) +
geom_jitter() +
scale_x_continuous(breaks=seq(50, 80, 2), limits=c(50, 80))
## Warning: Removed 6 rows containing missing values (geom_point).
Create a scatterplot of diamond price vs. volume (x * y * z) and color the points by the clarity of diamonds. Use scale on the y-axis to take the log10 of price. You should also omit the top 1% of diamond volumes from the plot.
diamonds$volume = diamonds$x * diamonds$y * diamonds$z
ggplot(data=subset(diamonds, price < quantile(price, probs=0.9) & volume > 0),
aes(x=volume, y=price, color=clarity)) +
geom_jitter() +
scale_y_log10(breaks=c(300, 550, 1000, 1700, 3000, 5500, 10000)) +
xlim(0, 500)
## Warning: Removed 2 rows containing missing values (geom_point).
Create a scatter plot of the price/carat ratio of diamonds. The variable x should be assigned to cut. The points should be colored by diamond color, and the plot should be faceted by clarity.
ggplot(data=diamonds, aes(x=cut, y=price/carat)) +
geom_jitter(aes(color=color), alpha=0.2) +
facet_wrap(~ clarity)
Diamond data of price vs carat with a linear model
ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point(shape=21) +
xlim(c(0, quantile(diamonds$carat, probs=0.9))) +
ylim(c(0, quantile(diamonds$price, probs=0.9))) +
stat_smooth(method="lm")
## Warning: Removed 6369 rows containing missing values (stat_smooth).
## Warning: Removed 6369 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing missing values (geom_path).
All things vs all things!
set.seed(20022012)
diamond_samp = diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, params = c(shape = I("."), outlier.shape = I(".")))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Histograms of price, with one being in log10
plt1 = ggplot(data = diamonds, aes(x = price)) + geom_histogram() +
ggtitle("Price")
plt2 = ggplot(data = diamonds, aes(x = price)) + geom_histogram(binwidth=0.01) +
scale_x_log10(breaks = c(100, 300, 1000, 3000, 10000)) +
ggtitle("Price (log10)")
grid.arrange(plt1, plt2, ncol=2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Scatter plot transformations
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point() +
scale_y_continuous(trans = log10_trans()) +
ggtitle("Price")
Create a new function to transform the carat variable
cuberoot_trans = function() trans_new("cuberoot",
transform = function(x) x^(1/3),
inverse = function(x) x^3)
Use the cuberoot_trans function
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits=c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits=c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle("Price (log10) by Cube-Root of Carat")
## Warning: Removed 1683 rows containing missing values (geom_point).
Overplotting revisited. Plot price vs carat like above, but change the points to add jitter and transparency
# head(sort(table(diamonds$carat), decreasing=T))
# head(sort(table(diamonds$price), decreasing=T))
#
ggplot(aes(carat, price), data = diamonds) +
geom_point(position="jitter", alpha=0.02) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).
Color the points by the diamond’s clarity
ggplot(aes(carat, price), data = diamonds) +
geom_point(position="jitter", alpha=0.2, aes(color=clarity)) +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).
Now color the points by the diamond’s cut
ggplot(aes(carat, price), data = diamonds) +
geom_point(position="jitter", alpha=0.2, aes(color=cut)) +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).
Lastly, color the points by the diamond’s color
ggplot(aes(carat, price), data = diamonds) +
geom_point(position="jitter", alpha=0.2, aes(color=color)) +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color', reverse = F,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).
Create a linear model for price
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data=diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamonds)
##
## ======================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
## (0.006) (0.019) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
## (0.007) (0.032) (0.031) (0.028) (0.016)
## carat -1.137*** -1.163*** -0.992*** -1.093***
## (0.012) (0.011) (0.010) (0.006)
## cut: .L 0.224*** 0.224*** 0.120***
## (0.004) (0.004) (0.002)
## cut: .Q -0.062*** -0.062*** -0.031***
## (0.004) (0.003) (0.002)
## cut: .C 0.051*** 0.052*** 0.014***
## (0.003) (0.003) (0.002)
## cut: ^4 0.018*** 0.018*** -0.002
## (0.003) (0.002) (0.001)
## color: .L -0.373*** -0.441***
## (0.003) (0.002)
## color: .Q -0.129*** -0.093***
## (0.003) (0.002)
## color: .C 0.001 -0.013***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.016*** -0.003*
## (0.003) (0.001)
## color: ^6 -0.023*** 0.001
## (0.002) (0.001)
## clarity: .L 0.907***
## (0.003)
## clarity: .Q -0.240***
## (0.003)
## clarity: .C 0.131***
## (0.003)
## clarity: ^4 -0.063***
## (0.002)
## clarity: ^5 0.026***
## (0.002)
## clarity: ^6 -0.002
## (0.002)
## clarity: ^7 0.032***
## (0.001)
## ----------------------------------------------------------------------
## R-squared 0.924 0.935 0.939 0.951 0.984
## adj. R-squared 0.924 0.935 0.939 0.951 0.984
## sigma 0.280 0.259 0.250 0.224 0.129
## F 652012.063 387489.366 138654.523 87959.467 173791.084
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7962.499 -3631.319 -1837.416 4235.240 34091.272
## Deviance 4242.831 3613.360 3380.837 2699.212 892.214
## AIC 15930.999 7270.637 3690.832 -8442.481 -68140.544
## BIC 15957.685 7306.220 3761.997 -8317.942 -67953.736
## N 53940 53940 53940 53940 53940
## ======================================================================
Let’s get an updated diamond data set
load("BigDiamonds.Rda")
Your task is to build five linear models like Solomon did for the diamonds data set only this time you’ll use a sample of diamonds from the diamondsbig data set.
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data=subset(diamondsbig, (cert == "GIA") & (price < 10000)))
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = subset(diamondsbig,
## (cert == "GIA") & (price < 10000)))
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = subset(diamondsbig,
## (cert == "GIA") & (price < 10000)))
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = subset(diamondsbig,
## (cert == "GIA") & (price < 10000)))
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = subset(diamondsbig, (cert == "GIA") & (price < 10000)))
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = subset(diamondsbig, (cert == "GIA") & (price <
## 10000)))
##
## ===========================================================================
## m1 m2 m3 m4 m5
## ---------------------------------------------------------------------------
## (Intercept) 2.671*** 1.333*** 0.949*** 0.529*** -0.464***
## (0.003) (0.012) (0.012) (0.010) (0.009)
## I(carat^(1/3)) 5.839*** 8.243*** 8.633*** 8.110*** 8.320***
## (0.004) (0.022) (0.021) (0.017) (0.012)
## carat -1.061*** -1.223*** -0.782*** -0.763***
## (0.009) (0.009) (0.007) (0.005)
## cut: V.Good 0.120*** 0.090*** 0.071***
## (0.002) (0.001) (0.001)
## cut: Ideal 0.211*** 0.181*** 0.131***
## (0.002) (0.001) (0.001)
## color: K/L 0.123*** 0.117***
## (0.004) (0.003)
## color: J/L 0.312*** 0.318***
## (0.003) (0.002)
## color: I/L 0.451*** 0.469***
## (0.003) (0.002)
## color: H/L 0.569*** 0.602***
## (0.003) (0.002)
## color: G/L 0.633*** 0.665***
## (0.003) (0.002)
## color: F/L 0.687*** 0.723***
## (0.003) (0.002)
## color: E/L 0.729*** 0.756***
## (0.003) (0.002)
## color: D/L 0.812*** 0.827***
## (0.003) (0.002)
## clarity: I1 0.301***
## (0.006)
## clarity: SI2 0.607***
## (0.006)
## clarity: SI1 0.727***
## (0.006)
## clarity: VS2 0.836***
## (0.006)
## clarity: VS1 0.891***
## (0.006)
## clarity: VVS2 0.935***
## (0.006)
## clarity: VVS1 0.995***
## (0.006)
## clarity: IF 1.052***
## (0.006)
## ---------------------------------------------------------------------------
## R-squared 0.888 0.892 0.899 0.937 0.969
## adj. R-squared 0.888 0.892 0.899 0.937 0.969
## sigma 0.289 0.284 0.275 0.216 0.154
## F 2700903.714 1406538.330 754405.425 423311.488 521161.443
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -60137.791 -53996.269 -43339.818 37830.414 154124.270
## Deviance 28298.689 27291.534 25628.285 15874.910 7992.720
## AIC 120281.582 108000.539 86691.636 -75632.827 -308204.540
## BIC 120313.783 108043.473 86756.037 -75482.557 -307968.400
## N 338946 338946 338946 338946 338946
## ===========================================================================
Let’s use that model to predict diamond price. Example diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity = "VS1")
modelEstimate = predict(m5, newdata = thisDiamond, interval="prediction",
level = 0.95)
actualEstimate = exp(modelEstimate)[1]
estimateSig = (exp(modelEstimate)[3] - exp(modelEstimate)[2])/2
truth = 5601